A compelling demonstration of why traditional statistical regression models cannot be used to identify risk factors from case data on infectious diseases: a simulation study

Background Regression models are often used to explain the relative risk of infectious diseases among groups. For example, overrepresentation of immigrants among COVID-19 cases has been found in multiple countries. Several studies apply regression models to investigate whether different risk factors can explain this overrepresentation among immigrants without considering dependence between the cases. Methods We study the appropriateness of traditional statistical regression methods for identifying risk factors for infectious diseases, by a simulation study. We model infectious disease spread by a simple, population-structured version of an SIR (susceptible-infected-recovered)-model, which is one of the most famous and well-established models for infectious disease spread. The population is thus divided into different sub-groups. We vary the contact structure between the sub-groups of the population. We analyse the relation between individual-level risk of infection and group-level relative risk. We analyse whether Poisson regression estimators can capture the true, underlying parameters of transmission. We assess both the quantitative and qualitative accuracy of the estimated regression coefficients. Results We illustrate that there is no clear relationship between differences in individual characteristics and group-level overrepresentation —small differences on the individual level can result in arbitrarily high overrepresentation. We demonstrate that individual risk of infection cannot be properly defined without simultaneous specification of the infection level of the population. We argue that the estimated regression coefficients are not interpretable and show that it is not possible to adjust for other variables by standard regression methods. Finally, we illustrate that regression models can result in the significance of variables unrelated to infection risk in the constructed simulation example (e.g. ethnicity), particularly when a large proportion of contacts is within the same group. Conclusions Traditional regression models which are valid for modelling risk between groups for non-communicable diseases are not valid for infectious diseases. By applying such methods to identify risk factors of infectious diseases, one risks ending up with wrong conclusions. Output from such analyses should therefore be treated with great caution. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-022-01565-1.


S1.1 Expression for reproduction number
We provide the expression for the basic reproduction number, 0 , in the setting with four subgroups, based on the largest eigenvalue of the next-generation matrix [1]. In this setting the nextgeneration matrix, , is defined by: We then choose a value of the overall susceptibility such that we can run with the desired basic reproduction number. The value is found numerically.

S1.2 Estimating from the data-generation-model using ABC
We investigate whether a simple rejection Markov Chain Monte Carlo approximate Bayesian computation algorithm (ABC-MCMC) [2] can estimate the true effect of ethnicity in the case 3 setting with four sub-groups and = 1.2. We define the ethnicity effect by the parameter such that ⋅ ℎ = ℎ = • ⋅ = ⋅ , that is, the high-risk group has an increased susceptibility by a factor = 1.2 and ethnicity group has an increased susceptibility by a factor . The aim is to understand whether it is possible to obtain the correct parameters when the data-generating model is taken into account. The idea behind ABC is to obtain parameters which provide simulations that are close to the observed data. The observed data are taken from a simulation with the model. We will assume that all parameters except and the ethnicity assortativity are known. We assume that the basic reproduction number is known, and hence calculate from the reproduction number. Hence, this is an ideal situation for the parameter estimation, and the results are likely to be less accurate in a real data situation. We assume two parameter values for , = 1.0 and = 1.05 and that the ethnicity assortativity is 10. We use a basic reproduction number of 1.3 and the contact structure between risk groups ( ℎℎ ℎ ℎ ) = ( 1/2 1/2 1/2 1/2 ). We also assume the same total number of contacts in all four groups, so ℎ = = ℎ = = 1. We first estimate both and the ethnicity assortativity, then we consider the case with a known assortativity and only estimate . We assume independent priors. For we assume a rather wide normal prior distribution with mean 1.0 and variance 9, truncated at 0. For the assortativity we assume a uniform prior from 0.05 to 20. We use the ABC-MCMC algorithm proposed in [2] and refer to that paper for the algorithm. We are also inspired by the review in [3]. The algorithm requires a starting value 0 , a proposal distribution, a distance measure between simulations and observations, and an acceptance threshold . We use starting values 1.2 for and 5 for the assortativity. We assume an independent normal proposal distribution centred at the current value of the parameters, with variance 0.2 for and 4 for the assortativity. We denote the proposal distribution by ( ) where is the current parameter value. As our distance measure, we compute the sum of the absolute deviance between the simulated and observed proportion of infected in ethnicity groups and . We use a threshold of = 0.001. The idea is that we start with proposing ~( −1 ). If has 0 prior probability, we sample again. We then simulate with and compute the distance between the simulations and observations. If the distance is below , the parameters are accepted with a probability which depends on the proposal distribution and the prior, and we set = . Otherwise = −1 . We use chains of length 1 000 000 and a burn-in of 100 000. As the true observations, we simulate from the model 100 times and hence perform 100 parameter calibrations in each setting.

S2.1.1 Case 1
The time series of infected for case 1 is provided in Figure S1, for some selected values of the reproduction number.

S2.1.2 Case 2
The time series of infected for case 2 is provided in Figure S2, for some selected values of the relative susceptibility in the high-risk group ( ) to the low-risk group ( ), . We note that the disease dynamics and total number infected for the low-risk group depends on , even though only affects the individual-level risks in the high-risk group.

S2.2 Results from ABC parameter estimation
We provide the results from the parameter estimation using ABC-MCMC. We study the setting where we estimate both the assortativity and , and when we assume that the assortativity is known and estimate only . Figure S3 shows the histogram of estimated parameters in the different settings. The histogram is based on the 900 000 samples for each of the 100 simulations. We note that the estimated is more accurate when we assume a known assortativity, but the mean value seems to be well captured in all four settings. We also note that the assortativity is not well estimated.
The estimated together with the 95%-credible interval from each simulation is provided in Figure  S4. We note that most of the credible intervals are centred at the correct value, but also that the intervals vary between the simulations, most likely due to the stochasticity of the disease spread model. Figure S4. Estimated and 95% credible interval for each of the 100 simulations.